Data frames

Data frames are basically the common tables you know from excel or from anywhere on the internet. Usually data.frame is the product of your long effort to preprocess and clean the data. To combine what we already know, data.frames are lists of vectors of the same size, which have functionality ot easily access rows of data across multiple vectors.

Data frames columns MUST have same length - missing values can be replaced with NAs, NaNs or NULLs; And similarly to the vector restraint, each column must have only a single variable type.



In [32]:

    
set.seed(1)
age = sample(c(10:25), 25, replace = T)
gender = sample(c("male", "female"), 25, replace = T)
smoker = sample(c(T, F), 25, replace = T)
BMI = rnorm(25, 20, 2)

df = data.frame(age = age, gender = gender, smoker = smoker, BMI = BMI)

There are some simple functions to examine data.frames



In [33]:

    
head(df)









    





age gender smoker BMI

	1 14              male            1               22.4766082017068
	2 15              male            0               19.4413074362915
	3 19              male            1               23.5158061796214
	4 24              female          1               21.1214921817761
	5 13              male            1               19.0944320548937
	6 24              male            1               18.3359134077643



In [34]:

    
summary(df)









    





      age           gender     smoker             BMI       
 Min.   :10.00   female:14   Mode :logical   Min.   :15.55  
 1st Qu.:14.00   male  :11   FALSE:8         1st Qu.:18.34  
 Median :19.00               TRUE :17        Median :19.65  
 Mean   :18.04               NA's :0         Mean   :19.84  
 3rd Qu.:22.00                               3rd Qu.:21.12  
 Max.   :25.00                               Max.   :24.88



In [35]:

    
nrow(df)
ncol(df)

Columns

Remember theat each column is basically a vector. Therefore if you select the vector, you can run any functions on it. It is also important to know the different types of subsetting lists. Single [n] will select the n-th element of a list WITH the name of the list - tehrefore it doesn't return a vector per se. Double [[n]] on the



In [36]:

    
df[3]
df[[3]]









    





smoker

	1 TRUE
	2 FALSE
	3 TRUE
	4 TRUE
	5 TRUE
	6 TRUE
	7 TRUE
	8 FALSE
	9 FALSE
	10 TRUE
	11 FALSE
	12 TRUE
	13 TRUE
	14 TRUE
	15 FALSE
	16 TRUE
	17 TRUE
	18 FALSE
	19 TRUE
	20 FALSE
	21 TRUE
	22 FALSE
	23 TRUE
	24 TRUE
	25 TRUE









    





	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	FALSE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	FALSE
	TRUE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE

Other way of selecting vectors is to follow the list way of selecting elements by name. That way uses $ operator. This selection is effectively same as the sellection with [[n]]. But remember, that if you want to use name of the column in brackets, you need to put a string there [["smoker"]] (otherwise it will search for a smoker variable).



In [37]:

    
df$smoker
df[["smoker"]]
df[["smoker"]] == df$smoker









    





	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	FALSE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	FALSE
	TRUE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE








    





	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	FALSE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	FALSE
	TRUE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE








    





	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE

And the data.frame own way to select columns is to use its df[ROW, COLUMN] statement. Column part accepts numbers as well as string



In [38]:

    
df[,3]
df[,"smoker"]









    





	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	FALSE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	FALSE
	TRUE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE








    





	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	FALSE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE
	FALSE
	TRUE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	TRUE



In [39]:

    
a = "BMI"
df[, a]









    





	22.4766082017068
	19.4413074362915
	23.5158061796214
	21.1214921817761
	19.0944320548937
	18.3359134077643
	17.6668589058306
	17.8688188392234
	16.872435897858
	22.3130739943004
	21.6640942571448
	19.5453426171505
	20.5322747233442
	19.2465945628327
	24.8827292577892
	18.4093217654893
	19.8902450525768
	20.5002826457083
	21.2364865871325
	19.6547529947083
	15.5521994519801
	17.4727712300588
	20.7174577919427
	19.9779090430687
	18.1187016747628

Subsetting

When we talk about subsetting data frames we usually mean selection of rows while keeping columns. But if you want to only kjeep some columns, use techniquest presented above.

There are many ways how to subset a data frame. The first thing to realise is that data frame is a list of vectors, therefore we can use similar functionality that lists have. The df[ROW, COLUMN] will also come in handy. If in doubt, go back to varaibles lecture about lists.

Basically we have two major ways of subsetting - using common indexing or using functions

Indexing

Indexing is possible with the use of either logical vectors or indices of rows. Imagine following daat frame

age	smoker	weight
17	yes	65
23	yes	87
25	no	74



In [40]:

    
small_df = data.frame(age = c(17, 23, 25), smoker = c(T, T, F), weight = c(65, 87, 74))

That means that you select the second row in these two ways.



In [41]:

    
small_df[c(F, T, F),]
small_df[2,]









    





age smoker weight

	2 23  1 87









    





age smoker weight

	2 23  1 87

Number indexing



In [42]:

    
age20smoker = which(df$age > 20 & smoker) # creating vector of indices
age20smoker
df[age20smoker,]









    





	4
	6
	7
	17
	21








    





age gender smoker BMI

	4 24              female          1               21.1214921817761
	6 24              male            1               18.3359134077643
	7 25              female          1               17.6668589058306
	17 21              female          1               19.8902450525768
	21 24              female          1               15.5521994519801

Logical indexing

The use of logical vector style is much more common, but maybe a bit harder to wrap your head around. It basically selects all parts that evaluate to true.



In [43]:

    
numbers = 1:10
log = rep(c(T,F), 5)
numbers
log
numbers[log]









    





	1
	2
	3
	4
	5
	6
	7
	8
	9
	10








    





	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE
	TRUE
	FALSE








    





	1
	3
	5
	7
	9

You can use logical vector of the



In [44]:

    
age20smoker = age > 20 & smoker #creating logical vector
age20smoker
df[age20smoker,]









    





	FALSE
	FALSE
	FALSE
	TRUE
	FALSE
	TRUE
	TRUE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	TRUE
	FALSE
	FALSE
	FALSE
	TRUE
	FALSE
	FALSE
	FALSE
	FALSE








    





age gender smoker BMI

	4 24              female          1               21.1214921817761
	6 24              male            1               18.3359134077643
	7 25              female          1               17.6668589058306
	17 21              female          1               19.8902450525768
	21 24              female          1               15.5521994519801



In [45]:

    
select_last = c(rep(F, 24), T)
select_last
df[select_last,]









    





	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	FALSE
	TRUE








    





age gender smoker BMI

	25 14              female          1               18.1187016747628



In [46]:

    
df_smokers = df[smoker,]
df_smokers$BMI
mean(df_smokers$BMI)









    





	22.4766082017068
	23.5158061796214
	21.1214921817761
	19.0944320548937
	18.3359134077643
	17.6668589058306
	22.3130739943004
	19.5453426171505
	20.5322747233442
	19.2465945628327
	18.4093217654893
	19.8902450525768
	21.2364865871325
	15.5521994519801
	20.7174577919427
	19.9779090430687
	18.1187016747628








    




19.8676893056573



In [47]:

    
zeny = gender == "female"
age22 = age > 22
zeny22 = zeny & age22
df[zeny22,]









    





age gender smoker BMI

	4 24              female          1               21.1214921817761
	7 25              female          1               17.6668589058306
	18 25              female          0               20.5002826457083
	21 24              female          1               15.5521994519801



In [48]:

    
# maximal BMI "male" age < 24 non-smoker
males = gender == "male"
age24 = age < 24
nonsmoker = !smoker
male24nonsmoker = males & age24 & nonsmoker
df[male24nonsmoker,]$BMI









    





	19.4413074362915
	17.8688188392234
	16.872435897858
	24.8827292577892
	17.4727712300588

	age	gender	smoker	BMI
1	14	male	1	22.4766082017068
2	15	male	0	19.4413074362915
3	19	male	1	23.5158061796214
4	24	female	1	21.1214921817761
5	13	male	1	19.0944320548937
6	24	male	1	18.3359134077643

	smoker
1	TRUE
2	FALSE
3	TRUE
4	TRUE
5	TRUE
6	TRUE
7	TRUE
8	FALSE
9	FALSE
10	TRUE
11	FALSE
12	TRUE
13	TRUE
14	TRUE
15	FALSE
16	TRUE
17	TRUE
18	FALSE
19	TRUE
20	FALSE
21	TRUE
22	FALSE
23	TRUE
24	TRUE
25	TRUE